Using machine learning to understand age and gender classification based on infant temperament

Age and gender differences are prominent in the temperament literature, with the former particularly salient in infancy and the latter noted as early as the first year of life. This study represents a meta-analysis utilizing Infant Behavior Questionnaire-Revised (IBQ-R) data collected across multiple laboratories (N = 4438) to overcome limitations of smaller samples in elucidating links among temperament, age, and gender in early childhood. Algorithmic modeling techniques were leveraged to discern the extent to which the 14 IBQ-R subscale scores accurately classified participating children as boys (n = 2,298) and girls (n = 2,093), and into three age groups: youngest (< 24 weeks; n = 1,102), mid-range (24 to 48 weeks; n = 2,557), and oldest (> 48 weeks; n = 779). Additionally, simultaneous classification into age and gender categories was performed, providing an opportunity to consider the extent to which gender differences in temperament are informed by infant age. Results indicated that overall age group classification was more accurate than child gender models, suggesting that age-related changes are more salient than gender differences in early childhood with respect to temperament attributes. However, gender-based classification was superior in the oldest age group, suggesting temperament differences between boys and girls are accentuated with development. Fear emerged as the subscale contributing to accurate classifications most notably overall. This study leads infancy research and meta-analytic investigations more broadly in a new direction as a methodological demonstration, and also provides most optimal comparative data for the IBQ-R based on the largest and most representative dataset to date.


Introduction
Although a number of approaches have been developed for the purpose of measuring temperament in childhood, including a variety of observational procedures and physiological techniques, parent report continues to be most widely used overall [1]. The latter is due to a number of factors, prominently among these being ease of administration and scoring as well as accessibility. Parent-report also provides descriptors of child temperament across time and situations, not just a "snapshot" of reactivity and/or regulation that can be gleaned from brief laboratory observations. Although multiple temperament theories or frameworks have been proposed, Rothbart's psychobiological model is generally viewed as most widely accepted at this time [2]. This approach casts temperament as constitutionally based individual differences in reactivity and self-regulation, with constitutional referring to the relatively enduring biological make-up of the individual, influenced by heredity, maturation, and experience. Reactivity refers to the arousability of emotional, motor, and attentional responses, assessed by threshold, latency, intensity, time to peak intensity, and recovery time of reactions. Self-regulation embodies processes that can serve to modulate reactivity, such as soothability and inhibitory control [3].
Although temperament has often been delineated into three overarching factors of Negative Emotionality, Positive Affectivity/Surgency, and Regulatory Capacity/Orienting, more recent studies emphasize the narrowly defined component scales. This shift toward a fine-grained approach is a function of research demonstrating individual scales that belong to the same overarching factor differentially predict important outcomes (e.g., behavior problems), present with growth trajectories discrepant from the overarching factors, and contribute to temperament profiles in a manner inconsistent with the overarching factor content (i.e., scales that load onto different factors contribute to the same profile, and vice versa-components of the same factor define different profiles/classes; [4][5][6][7]. The Infant Behavior Questionnaire-Revised (IBQ-R) designed to provide indicators of infant temperament comprises 14 fine-grained scales: Activity Level, Smiling/Laughter, Approach, High Intensity Pleasure, Perceptual Sensitivity, Vocal Reactivity, Fear, Distress to Limitations, Sadness, Falling Reactivity, Duration of Orienting, Soothability, Cuddliness/Affiliation, and Low Intensity Pleasure, and is the focus of this investigation.

Development of temperament and age differences
Manifestations of temperament transform over development, with rapid change during infancy [8]. Positive emotionality (e.g., smiling), rarely expressed during the newborn period, is observed more reliably between ages two and three months, and increases in expression throughout the first year of life [8,9]. Levels of activity, approach, distress to limitations, and PLOS ONE  [10][11][12][13][14]. Anger reactions across infancy appear to follow a U-shaped trajectory [12,15]. The decrease in anger responses occurring between 2 and 6 months of age has been linked to greater flexibility in attention shifting [16].
In the second half of the first year, infants are likely to respond with anger when unable to grasp an attractive stimulus that has been placed out of reach, or when a caregiver has removed a forbidden object. Fear generally increases throughout the second half of the first year of life [10,[12][13][14], with inhibition of approach toward novel and/or intense stimuli "coming online" [14,17]. The developmental course of attentional orienting has been described as U-shaped in the first year of life [18]. Carranza and colleagues [12], for example, noted decreases in Duration of Orienting between 6 and 9 months, followed by an increase between 9 and 12 months. Toward the end of the first year, skills associated with the development of the executive attention system may contribute to the flexibility of orienting reactions [19][20][21]. Infants also gain communication skills rapidly during the first year of life [22,23], and thus exhibit greater vocal reactivity over time.
With respect to age/developmental differences discerned via the IBQ-R, older infants obtain higher scores on Approach, Vocal Reactivity, High Intensity Pleasure, Activity Level, Perceptual Sensitivity, Distress to Limitations, and Fear, whereas younger infants' scores are higher for Low Intensity Pleasure, Cuddliness/Affiliation, and Duration of Orienting [24,25]. More recent longitudinal investigations provided further evidence of increases in Fear across the first year of life [5,26], also noting increases in Distress to Limitations and Sadness, albeit not always linear in nature. Falling Reactivity was associated with a quadratic trajectory, with increases followed by declining values later in infancy. Increasing trajectories were noted for attributes associated with Positive Affectivity/Surgency, with trends toward greater Activity Level, Smiling and Laughter, High Intensity Pleasure, Approach, Perceptual Sensitivity, and Vocal Reactivity later in infancy. Growth modeling provided evidence of nonlinear changes in Duration of Orienting, Soothability, Cuddliness, and Low Intensity pleasure, wherein initial growth in values was followed by decreases later in infancy [5]. These findings are largely consistent with prior research relying on different measurement approaches. Although the data examined in this study are cross-sectional in nature, earlier longitudinal evaluations are informative as their results speak to the importance of age in shaping temperament presentations, and vice versa-temperament features as predictors of infant age. It should be noted that no study to date has explored the latter, that is, used temperament dimensions to classify infants with respect to their age, likely due to sample size limitations and only recently available methodological advances in empirically based classification techniques.

Gender differences in temperament
Although a number of gender differences in temperament have been reported for older children and adults, fewer exist for children younger than one year of age [8,25,27,28]. Differences in infancy have been limited to activity level and fear/behavioral inhibition. Higher activity level and approach is evident in boys [29,30], with girls exhibiting greater hesitation in approaching novel objects [14,31]. Campbell and Eaton [29] applied meta-analytic procedures to summarize 46 studies addressing activity level in infancy, estimating the size of the gender difference at 0.2 standard deviations based on objective measures (parent-report measures estimated the difference to be smaller). Gender differences in approach-withdrawal have been reported for samples from different countries [30,[32][33][34], with parents rating boys higher in approach. Martin et al. [31] reported a large and significant gender difference for distress to novelty, with 6-month-old girls receiving higher scores.
Gender differences also have been documented with the IBQ-R, as boys received higher scores on Activity and High Intensity Pleasure, and girls higher scores on Fear [24,25,35,36]. Infant gender also predicted intercept values of Fear trajectories, with girls demonstrating higher levels at 4 and 6 months [5,26]. Girls also started out at lower values (i.e., intercept estimates) for Activity Level, Approach, and High Intensity Pleasure. Similar to age/developmental differences research, gender-related temperament studies have only compared temperament for boys and girls, not considering gender classification based on temperament features. Importantly, age-and gender-based temperament distinctions have not been considered jointly, discerning whether age-related changes inform gender differences.

Present study
In this study, we leveraged IBQ-R data collected across multiple laboratories (N = 4,438) to further investigate age and gender differences in infancy, addressing yet unanswered questions. Specifically, algorithmic modeling techniques were used to discern the extent to which the 14 IBQ-R subscale scores (referred to as features) accurately classified participating children as boys (n = 2,298) or girls (n = 2,093; 47 children were missing gender data) and into three age groups: youngest (< 24 weeks; n = 1,102), mid-range (24 to 48 weeks; n = 2,557), and oldest (> 48 weeks; n = 779), because of previously noted gender-based variability [14,[29][30][31][32][33][34] and significant developmental differences among these age groups (e.g., with respect to brain growth and maturation; [37,38]). This study addresses an important gap in research, being the first to consider temperament attributes as determinants of age and gender groupings, quantifying the extent to which early reactivity and regulation provide the features necessary for accurate prediction. Importantly, this work also allows for simultaneous classification of age and gender categories, providing an opportunity to consider the extent to which gender differences are informed by infant age, and to our knowledge, this is the first to study to do so. That is, despite prior demonstrations of reliable age and gender differences in temperament, the two classifications have not been considered jointly, examining whether gender differences were age dependent in a single investigation. Moreover, this effort provides a new direction for infancy and temperament research, serving as a methodological demonstration of machine learning applications, not yet utilized in these areas of scientific inquiry. This meta-analytic data driven effort is the first to rely on advanced machine learning techniques using temperament features to classify infants into age and gender groups, rather than compare temperament of children who vary in age and gender, considering these classifications simultaneously. This cross-laboratory effort also overcomes prior limitations associated with small samples that were not representative, producing results circumscribed in terms generalizability.

Measures
The Infant Behavior Questionnaire-Revised (IBQ-R; [24]. This parent-report measure of temperament was developed for infants between 3-and 12-months of age. The IBQ-R contains 191 items, which yield 14 scales: Activity Level, Smiling/Laughter, Approach, High Intensity Pleasure, Perceptual Sensitivity, Vocal Reactivity (loading onto Positive Affectivity/Surgency); Fear, Distress to Limitations, Sadness, Falling Reactivity (Negative Emotionality); Duration of Orienting, Soothability, Cuddliness/Affiliation, Low Intensity Pleasure (Regulatory Capacity/ Orienting). Individual items are rated on a 7-point scale reflecting the frequency of occurrence of the behavior in the past week (two weeks for less frequent events, such as encounters with unfamiliar settings/adults). Reliability of the IBQ-R has been supported for mothers and fathers, as well as samples from different cultures, with Cronbach's α values ranging from .77 to .96 [39][40][41]. Evidence supports the predictive and construct validity of IBQ-R scores [42][43][44]. Cronbach's α values for the 14 subscales included in the current analysis, derived from 29 datasets, ranged from .74 to .89 (mean α = .82). These temperament features were used to classify children into gender and age categories via Machine Learning algorithms.

Procedure
Data sets (N = 29) were acquired by emailing researchers who requested the IBQ-R or published research using the instrument between 2006  Contributors were asked to provide item level data from the IBQ-R as well as infant age, gender, and race. For all participants, the IBQ-R was completed by the infant's mother. See Table 1 for a brief description of the samples.

Analytic strategy
Descriptive statistics across gender and age groups were computed first ( Table 2). We then constructed a model framework allowing us to assess the utility of fine-grained temperament dimensions with respect to gender and age classifications. This framework resulted in a total of five (5) model types, which included: 1) gender: boys vs. girls; 2) age groups: youngest (< 24 weeks) vs. mid-range (24 to 48 weeks) vs. oldest (> 48 weeks) infants; and gender by age group analyses: 3) boys vs. girls in the youngest age group; 4) boys vs. girls in the mid-range age group; 5) boys vs. girls in the oldest age group. Classification of infant gender within age groups allows us to determine if predictive strength of gender-based classification is more accurate for younger vs. older infants.
Established machine learning techniques, methodologically rigorous and shown to provide reliable/reproducible results, were used in this study (e.g., [45,46]). Specifically, for all models, we used repeated 10-fold cross-validation partitioning with random assignment: a training dataset including 70% of the sample, and 30% reserved as a hold-out dataset (testing) to evaluate the predictive utility of the trained models. A total of 11 different algorithms were considered for each model type, including: (1) linear discriminant analysis; (2) generalized linear modeling; (3) support vector machines; (4) K-nearest neighbor; (5) naïve bayes; (6) classification and regression trees; (7) C5.0 classification; (8) bootstrapped aggregated trees; (9) ensembled decision trees (Random Forest; [47,48]); (10) gradient boosting; and (11) multiclass adaptive boosting (AdaBoost). These algorithms were chosen based on their applicability and widespread use in the classification modeling literature [45,46], and in order to achieve most robust and replicable results discernable across multiple modeling techniques. The aforementioned models were then compared to discern the most effective classification of infant Misclassification provides a simplistic posterior assessment of model classification based on contingency tables and is often used for initial classification and model accuracy evaluation. Accuracy indicators, reported herein, represent the inverse of misclassification rates. Cohen's kappa coefficient assesses reliability of categorization, which incorporates chance agreement, is normalized, and can range from -1 to 1. Kappa values will typically be lower than overall misclassification indictors, as it represents a more conservative estimate given its assessment of accuracy compared to random assignment. The area under an ROC curve (area under the curve, or AUC) is a third metric used to evaluate the accuracy of binary classifiers, which encapsulates both Type I and Type II errors [49]. However, ROC-AUC is limited insofar as it does not take predicted probability values and goodness of fit of evaluated models into account. While all three indicators provide unique assessments of classification accuracy, overall misclassification rate (or, inversely, accuracy) is the most broadly used metric for classification evaluation [50]. For all of the model classification indices, higher values (i.e., closer to 1) can be considered superior, indicative of more optimal performance.

Results
Overall, classification accuracy was superior for age relative to gender categories, based on misclassification rates (i.e., accuracy indicators), Kappa, and area under the curve (AUC) indicators (Table 3A).
Specifically, across all algorithmic models, age-based classification outperformed genderbased classification for all classification outcomes.
Gender classification was performed within the three infant age groups next (Table 3B), with classification effectiveness for gender generally superior in the oldest age group (> 48 weeks). That is, oldest age group classification models consistently outperformed others based on the AUC, and this was the case for the majority of classification algorithms with respect to accuracy and Kappa indicators. Next, we focused on the AUC, especially informative in

PLOS ONE
Infant temperament/age and gender classification capturing differences for gender classification models across age groups because of its longstanding widespread use for comparative purposes in the machine learning classification literature [51] and visualization capabilities (Figs 1-3). AUC gender classification indicators were superior for the oldest age group, yielding higher values across different algorithmic models, illustrated in Fig 3.

Discussion
We set out to leverage existing IBQ-R datasets from multiple laboratories (N = 4,438) to address an important gap in research by investigating age and gender classifications in early childhood, and overcoming limitations of the published studies such as small sample sizes that cannot be considered representative or provide widely generalizable results. Relying on algorithmic modeling techniques, 14 IBQ-R subscale scores served as features used to classify participating children as boys (n = 2,298) and girls (n = 2,093), and into three age groups:

PLOS ONE
Infant temperament/age and gender classification youngest (< 24 weeks; n = 1,102), mid-range (24 to 48 weeks; n = 2,557), and oldest (> 48 weeks; n = 779). Importantly, this approach allowed us to simultaneously classify infants into age and gender categories, providing an opportunity for the first time to consider the extent to which gender differences are informed by infant age. This study also makes an important contribution to the literature as a novel methodological demonstration. That is, the present

PLOS ONE
Infant temperament/age and gender classification application of machine learning algorithms provides a new direction for infancy and temperament research, as well as meta-analytic investigations more broadly. Results based on accuracy indicators (the inverse of misclassification rates), Cohen's kappa coefficients, and AUC (incorporating sensitivity and specificity parameters) demonstrated that temperament features provided superior classification of age groups relative to gender, which is consistent with the existing literature insofar as age effects have generally been more robust (e.g., not dependent on methodology; [5,26,52]). As noted, gender differences in infancy have been largely limited to activity level and fear/behavioral inhibition, with higher activity level and approach reported for boys [29,30] and greater fear/behavioral inhibition for girls [14,25,31,35,36]. These gender differences are somewhat controversial due to a lack of consensus regarding their origin (i.e., biologically based or largely a function of socialization; [53]) and questions regarding the role of parental expectations. That is, parents could rate boys and girls differently not due to actual variability in behavior but as a function of their own culturally influenced ideas about what is typical behavior in boys vs. girls. This explanation cannot be ruled out completely, although existing research suggests that gender differences are not entirely dependent on methodology (i.e., have been identified via behavioral observations along with parent report; [33,52]). Importantly, gender classification by age groups results suggest this is most effective for the oldest age group, in line with the literature that indicates gender differences in temperament attributes become more pronounced with age [54]. Although a number of factors could be contributing to this pattern of results-accentuated gender differences in temperament with increasing age, and, conversely more accurate classification of gender with temperament features for oldest participants-socialization is often described as critical among these. The primary mechanism invoked in such explanations involves the infants' interactional history, and is consistent with literature that indicates mothers respond differently to their sons and daughters [55][56][57][58][59], presenting with different affordances as social interaction partners (e.g., [60]). Over time, such differences could result in divergent trajectories with respect to temperament

PLOS ONE
Infant temperament/age and gender classification due to differences in socialization goals/approaches for boys vs. girls. Specifically, parents may prioritize relationship orientation for daughters, but competence and autonomy for sons [61][62][63]. These and other socialization-related pathways may be responsible for the stronger temperament-based classification of boys and girls later in infancy observed herein.
At the same time, gender is viewed as a marker for a host of sex-linked distinctions in physiological processes. For example, prenatal exposure to high levels of androgen is predictive of later behavior problems, primarily of the externalizing type (e.g., ADHD; [64]), and used to explain early vulnerability observed in boys with respect to this set of problems [65]. Postpartum biological effects are also possible, for example via testosterone increases for boys in infancy, referred to as "mini-puberty," peaking by the second month and returning to baseline at about 6 months [66]. Sex-linked differentiation in brain structures and functions occurs with maturation, resulting in greater discrepancies with age. For example, Goldstein et al. [67] reported that the amygdala tends to be larger in males and the hippocampus larger in females (see Hines [68] for a related review).
Follow-up analyses outlining feature importance for classification models were performed for the Ensembled Decision Trees (Random Forest) to further interpretation of the observed results. Random Forest methods provide an effective mechanism for feature selection and importance using tree-based mechanisms to rank node classification via the mean decrease in gini impurity, i.e., the probability that a random sample in a particular tree node would be mislabeled using the distribution of the node sample, averaged across all trees [69]. Figures provided in Supplemental Materials (S1-S3 Figs) demonstrate that while Fear was the most important feature in distinguishing boys and girls for the youngest and mid-range age group, for oldest infants, low intensity pleasure was most influential. In fact, for youngest infants (S3 Fig), all three distress-related scales (Fear, Distress to Limitations, Sadness) were of primary importance in classifying infants accurately by gender via the Random Forest algorithm. Positive emotionality and regulatory dimensions of temperament (e.g., Falling Reactivity, Approach) begin to take on greater importance for mid-range and oldest infants. Notably, certain temperament features detracted from model accuracy in classifying infants by gender (i.e., associated with lowest negative importance values), particularly Cuddliness, Vocal Reactivity, and Smiling and Laughter in the youngest age group and Smiling and Laughter, Perceptual Sensitivity, and Activity in the oldest age group. These results identify the temperament attributes that did not differentiate boys and girls effectively, and it is of interest that the list of these poorly differentiating features varied by age. When the most important features were considered for age classification and gender classification models only, Fear again emerged as the critical dimension, which is in line with the extensive literature documenting the developmental progression as well as gender differences for this domain of temperament [2,13,14,26,54].
This work is not without limitations, chief among these our reliance on a single method (i.e., parent report) in the assessment of infant temperament. Future studies should aggregate datasets providing different sources of information, including behavioral observations and physiological measures, such as cortisol reactivity, heart rate variability/respiratory sinus arrhythmia, and/or frontal alpha asymmetry ascertained via electroencephalogram (EEG) recordings. In addition, the outcomes examined in this study were limited to child gender and age. Future studies with older children should conduct classification analyses with additional dependent variables, particularly symptom and disorder classifications (e.g., clinical/subclinical/asymptomatic ADHD). It should be noted that we did not consider classification based on race/ethnicity because of a far more limited literature suggesting these differences can be discerned on the basis of temperament, and future research should examine related models, as relevant studies accumulate. Finally, the present modeling approach could be extended and potentially improved by applying ensembling modeling approaches (i.e., using multiple algorithms simultaneously), as opposed to relying on singular modeling frameworks.
This study underscores the importance of meta-analytic investigations and cross-laboratory collaborations, providing illusive answers to questions, such as those related to intersections of gender and age in temperament development, that have not been previously addressed. Because of the large cross-laboratory sample included herein, this study provides most optimal comparative data for the IBQ-R (Table 2), which has emerged as a widely used infant temperament assessment tool. Importantly, the present investigation serves as a methodological illustration for application of machine learning techniques in infancy and temperament research, as well as developmental science more broadly. Given the propensity for differing algorithmic methods to have strengths and weaknesses that may bias predictive outcomes and classification accuracy, we selected 11 established algorithmic modeling and classification techniques to quantify the most robust outcomes, simultaneously demonstrating the viability of machine learning approaches in this area of scientific inquiry. Results of this study make an important contribution to developmental temperament research, demonstrating effective age group classification on the basis of fine-grained temperament features, and indicating more effective gender classification for the older age group, with multiple implications for future mechanistic research examining potential socialization and biological contributors.